Matthew Talluto
22.02.2022
\[ pr(y|x) = pr(y) \]
\[ \mathrm{cov}_{xy} = \frac{\sum_{i=1}^n \left (x_i - \bar{x} \right) \left(y_i - \bar{y} \right)}{n-1} \]
\[ \begin{aligned} \mathrm{cov}_{xy} & = \frac{\sum_{i=1}^n \left (x_i - \bar{x} \right) \left(y_i - \bar{y} \right)}{n-1} \\ r_{xy} & = \frac{\mathrm{cov}_{xy}}{s_x s_y} \end{aligned} \]
\(H_0\): \(r = 0\)
\(H_A\): two sided (\(\rho \ne 0\)) or one-sided (\(\rho > 0\) or \(\rho < 0\))
\(r\) has a standard error:
\[ s_{r} = \sqrt{\frac{1-r^2}{n-2}} \] We can then compute a \(t\)-statistic:
\[ t = \frac{r}{s} \]
The probability that \(t > \alpha\) (i.e., use the CDF of the t distribution) is the p-value.
data(iris)
iris_set = subset(iris, Species == "setosa")
n = nrow(iris_set)
r = cor(iris_set$Sepal.Length, iris_set$Petal.Length)
r
## [1] 0.2671758
s_r = sqrt((1-r^2)/(n-2))
s_r
## [1] 0.1390906
t_val = r/s_r
t_val
## [1] 1.920876
2 * pt(t_val, n-2, lower.tail = FALSE) ## two-sided test
## [1] 0.06069778## Equivalent built-in test
with(iris_set, cor.test(Sepal.Length, Petal.Length, alternative = "two.sided"))
##
## Pearson's product-moment correlation
##
## data: Sepal.Length and Petal.Length
## t = 1.9209, df = 48, p-value = 0.0607
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.01206954 0.50776233
## sample estimates:
## cor
## 0.2671758cor.test and cor)prop.test) or \(\chi^2\) test (chisq.test)Correlation assumes nothing about the causal nature of the relationship between x and y
Often, we have a hypothesis that a variable is in some way functionally dependent on another
A simple linear regression describes/tests the relationship between
\[ \begin{aligned} \mathbb{E}(y|x) & = \hat{y} = \alpha + \beta x \\ & \approx a + bx \\ \\ \end{aligned} \]
\(y\) is not perfectly predicted by \(x\), so we must include an error (residual) term:
\[ \begin{aligned} y_i & = \hat{y_i} + \epsilon_i \\ & = a + bx_i + \epsilon_i\\ \\ \epsilon & \sim \mathcal{N}(0, s_{\epsilon}) \end{aligned} \]
\[ \begin{aligned} \hat{y_i} & = a + b x_i \\ y_i & = \hat{y_i} + \epsilon_i \\ \epsilon & \sim \mathcal{N}(0, s_{\epsilon}) \end{aligned} \]
We want to solve these equations for \(a\) and \(b\)
The “best” \(a\) and \(b\) are the ones that draw a line that is closest to the most points (i.e., that minimizes \(s_{\epsilon}\))
\[ s_{\epsilon} = \sqrt{\frac{\sum_{i=1}^n \left (y_i -\hat{y_i}\right )^2}{n-2}} \]
\[ \begin{aligned} \mathrm{ESS} & = \sum_{i=1}^n \left (y_i -\hat{y_i}\right )^2 \\ & = \sum_{i=1}^n \left (y_i - a - bx_i \right )^2 \end{aligned} \]
Solving for the minimum ESS yields:
\[ \begin{aligned} b & = \frac{\mathrm{cov_{xy}}}{s^2_x} \\ \\ & = r_{xy}\frac{s_y}{s_x}\\ \\ a & = \bar{y} - b\bar{x} \end{aligned} \]
The parameters have standard errors:
\[ s_a = s_{\hat{y}}\sqrt{\frac{1}{n} + \frac{\bar{x}^2}{\sum{(x-\bar{x}}^2)}} \]
\[ s_b = \frac{s_{\hat{y}}}{\sqrt{\sum{(x-\bar{x}}^2)}} \]
\(\mathbf{H_0}\): The slope \(\beta\) = 0 (i.e., no variation in \(y\) is explained by variation in \(x\))
The ratio of variance explained to residual variance follows an \(F\) distribution
\[\begin{aligned} F & = \frac{MS_{model}}{MS_{err}} \\ MS_{model} & = \sum_{i=1}^n \left ( \hat{y}_i - \bar{y}\right)^2 \\ MS_{err} & = \frac{\sum_{i=1}^n \left ( y_i - \hat{y}_i \right)^2}{n-1} \end{aligned}\]The coefficient of determination tells you the proportion of variance explained by the model:
\[ r^2 = \frac{\sum_{i=1}^n \left ( \hat{y_i} - \bar{y} \right )^2} {\sum_{i=1}^n \left ( y_i - \bar{y} \right )^2} \]
## Data on sparrow wing lengths from Zar (1984)
dat = data.frame(age = c(3:6, 8:12, 14:17),
wing_length = c(1.4, 1.5, 2.2, 2.4, 3.1, 3.2, 3.2, 3.9, 4.1, 4.7, 4.5, 5.2, 5.0))
mod = lm(wing_length ~ age, data = dat)
summary(mod)
##
## Call:
## lm(formula = wing_length ~ age, data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.30699 -0.21538 0.06553 0.16324 0.22507
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.71309 0.14790 4.821 0.000535 ***
## age 0.27023 0.01349 20.027 5.27e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2184 on 11 degrees of freedom
## Multiple R-squared: 0.9733, Adjusted R-squared: 0.9709
## F-statistic: 401.1 on 1 and 11 DF, p-value: 5.267e-10
confint(mod)
## 2.5 % 97.5 %
## (Intercept) 0.3875590 1.0386300
## age 0.2405308 0.2999272par(mfrow=c(1, 2), bty='n')
## scale(residuals) produces **standardized** residuals
qqnorm(scale(residuals(mod)), pch=16)
qqline(scale(residuals(mod)), lty=2)
plot(dat$age, residuals(mod), xlab = "age", ylab = "residuals", pch=16)
abline(h = 0, lty=2)We found a significant positive relationship between wing length and age (\(F_{1,11} = 400\), \(p < 0.001\), \(R^2 = 0.97\); Table 1)
| estimate | st. error | 95% CI | |
|---|---|---|---|
| intercept | 0.71 | 0.150 | 0.39, 1.03 |
| age | 0.27 | 0.013 | 0.24, 0.30 |
\[ \hat{y} = a + b_1x_1 + b_2x_2 \]
\[ \hat{y} = \mathbf{X}\mathbf{B} \]
\(\mathbf{B}\) is the parameter vector
\(\mathbf{X}\) is the design matrix
\[ \mathbf{X} = \begin{bmatrix} 1 & x_{1,1} & x_{2,1} & \dots & x_{k,1} \\ 1 & x_{1,2} & x_{2,2} & \dots & x_{k,2} \\ \vdots & \vdots & \vdots & \dots & \vdots \\ 1 & x_{1,n} & x_{2,n} & \dots & x_{k,n} \\ \end{bmatrix} \]
\[ \begin{aligned} \hat{y} & = \mathbf{X}\mathbf{B} \\ y &= \mathbf{X}\mathbf{B} + \epsilon \\ \epsilon & \sim \mathcal{N}(0, s_\epsilon) \end{aligned} \]
The equation is a linear system and can be solved with linear algebra by OLS, minimizing the sum of squared errors:
\[ \mathrm{min}: \sum \epsilon^2 = \sum \left (\mathbf{X}\mathbf{B} - y \right)^2 \]
Or in R we can just add terms to the formula ;-)
lm) (or check if CI overlaps 0)\[ b_i^* = b_i \frac{s_{x_i}}{s_y} \]
set.seed(1515)
n_x = 20
n_obs = n_x*10
y = rnorm(n_obs) ## random y-variable
## 20! random x-variables, no relationship to y
x = matrix(rnorm(n_obs*n_x), ncol=n_x)
dat = data.frame(cbind(y, x))
colnames(dat) = c("y", paste0('x', 1:n_x))
## y ~ . means use every remaining variable in the data frame
mod = lm(y ~ ., data = dat)
summary(mod)
##
## Call:
## lm(formula = y ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.71527 -0.70863 -0.09133 0.78663 2.73273
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0510218 0.0838196 0.609 0.5435
## x1 -0.0992465 0.0832114 -1.193 0.2346
## x2 -0.2188291 0.0850971 -2.572 0.0109 *
## x3 0.0100788 0.0828715 0.122 0.9033
## x4 -0.0504836 0.0802768 -0.629 0.5302
## x5 0.1498394 0.0897593 1.669 0.0968 .
## x6 0.0351077 0.0849153 0.413 0.6798
## x7 -0.0366265 0.0750078 -0.488 0.6259
## x8 -0.0007617 0.0832326 -0.009 0.9927
## x9 0.0001057 0.0851478 0.001 0.9990
## x10 0.0961346 0.0908632 1.058 0.2915
## x11 -0.1038063 0.0855094 -1.214 0.2264
## x12 -0.0371452 0.0825934 -0.450 0.6534
## x13 0.0299666 0.0847690 0.354 0.7241
## x14 0.0931749 0.0805042 1.157 0.2487
## x15 -0.0174443 0.0828449 -0.211 0.8335
## x16 -0.0980819 0.0869842 -1.128 0.2610
## x17 0.0593543 0.0827010 0.718 0.4739
## x18 -0.0296575 0.0768314 -0.386 0.6999
## x19 0.0342252 0.0888166 0.385 0.7004
## x20 -0.0040929 0.0917435 -0.045 0.9645
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.11 on 179 degrees of freedom
## Multiple R-squared: 0.08315, Adjusted R-squared: -0.0193
## F-statistic: 0.8116 on 20 and 179 DF, p-value: 0.6974set.seed(1515)
n_x = 20
n_obs = n_x*2
y = rnorm(n_obs) ## random y-variable
## 20! random x-variables, no relationship to y
x = matrix(rnorm(n_obs*n_x), ncol=n_x)
dat = data.frame(cbind(y, x))
colnames(dat) = c("y", paste0('x', 1:n_x))
## y ~ . means use every remaining variable in the data frame
mod = lm(y ~ ., data = dat)
summary(mod)
##
## Call:
## lm(formula = y ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.37931 -0.53334 0.00515 0.30438 1.80457
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.36919 0.23534 1.569 0.1332
## x1 -0.37769 0.19326 -1.954 0.0655 .
## x2 -0.26935 0.20836 -1.293 0.2116
## x3 0.01585 0.18513 0.086 0.9327
## x4 -0.24950 0.18277 -1.365 0.1882
## x5 0.11977 0.23935 0.500 0.6226
## x6 -0.11829 0.22575 -0.524 0.6064
## x7 0.32152 0.23386 1.375 0.1852
## x8 0.28923 0.21415 1.351 0.1927
## x9 -0.40926 0.22296 -1.836 0.0821 .
## x10 -0.67912 0.24546 -2.767 0.0123 *
## x11 -0.05407 0.22265 -0.243 0.8107
## x12 0.06647 0.20304 0.327 0.7470
## x13 -0.05534 0.23823 -0.232 0.8188
## x14 -0.57639 0.25363 -2.273 0.0349 *
## x15 -0.08342 0.19241 -0.434 0.6695
## x16 -0.08631 0.22833 -0.378 0.7096
## x17 0.40355 0.20550 1.964 0.0644 .
## x18 0.60715 0.34957 1.737 0.0986 .
## x19 -0.27751 0.21418 -1.296 0.2106
## x20 -0.16230 0.21580 -0.752 0.4612
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9787 on 19 degrees of freedom
## Multiple R-squared: 0.6081, Adjusted R-squared: 0.1956
## F-statistic: 1.474 on 20 and 19 DF, p-value: 0.201qqnorm(residuals(mod)), qqline(residuals(mod)), and plot(mod)cor on predictor matrix, check for large correlationsIgnoring \(y\) for a moment, we can perform regressions of the \(x\) variables against each other:
\[ x_i = b_0 + b_1x_1 \dots b_kx_k +\epsilon \mathrm{~;~excluding~x_i} \]
\[ \mathrm{VIF}_i = \frac{1}{1-R^2_i} \]
The VIF tells you that \(s_b = s_{b,\rho=0}\sqrt{\mathrm{VIF}}\)